Skip to content

Conversation

@niemela
Copy link
Member

@niemela niemela commented Dec 19, 2025

Closes #539

I don't feel we had consensus on the issue yet, but this is how it could be done.

Please chime in...

@niemela niemela requested review from Matistjati and Tagl December 19, 2025 02:03
It is good practice to use a numbered prefix such as `00`, `01`, `02`, `03`, and so on, to get the desired order of test cases, while keeping the file names descriptive.
Remember that the numbered prefixes should be zero padded to the same length to get the expected lexicographical order.

Test case file names (the base name of the `.in` file) must be unique across the entire `data/` directory, unless the test cases are equivalent.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should be a tiny bit more specific with the definition of equivalence? Surely we don't care about output validator flags in this definition, right?
For any two test cases, if the contents of their .in and .files directory are equivalent, as well as the args sequence in the .yaml file, then the input of the two test cases is equivalent. For any two test cases, if their input, output validator arguments and the contents of their .ans files are equivalent, then the test cases are equivalent.

At the very least, we should say "if their inputs are equivalent". Additionally, we should probably either copy paste the definition or link to it.

Copy link
Collaborator

@Tagl Tagl Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely we don't care about output validator flags in this definition, right?

Agreed, I want to be able to reuse a .in file with different output validator flags.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, maybe we do, but then we have two different kinds of equivalence. The one you want to use is "the inputs are equivalent", the one we already have defined and that I used is "the test cases are equivalent". The latter allows a judge system to reuse the results of judging the test case, the former does not. This is why I would like to use that definition.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, maybe we do, but then we have two different kinds of equivalence.

Yes. For a concrete example: Sweden has a problem asking "find the min and max possible thing for a given input", with subtask "you only need to find the max correctly". I would argue that the most correct solution in this instance is that this property is part of the group via output_validator_flags, not any testcase itself, and we want to be able to reuse them (problem in question: https://po2punkt0.kattis.com/problems/robottavlingen)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... I see you have not read the entirety of me and @thorehusfeldt's discussion in #523 😏.

In that discussion the consensus seems to go towards output_validator_flags being part of "the test case". I think @thorehusfeldt's is arguing from a point of "the sameness of a test case should imply the sameness of the judgement of said test case", and I would agree with that. I feels strange to say that you could pass a test case and then fail "the same" test case? They are quite obviously not the same then.

So, IMO, what you are talking about is identical input, not identical test cases. I would argue that that can be sufficiently handled by symlinks of copying?

Copy link
Contributor

@mzuenni mzuenni Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit late to the party but I have to agree with @Matistjati and @Tagl. I would go even further and would say that current proposal sounds pretty useless and does not fit the workflow used in the past. Let me explain why:

the sameness of a test case should imply the sameness of the judgement of said test case

This is the exact part that makes this proposal useless. If the judgement is the same is there even a need for the "same test case" in the first place?

I would also argue that this statement is fundamentally broken, since the judgement heavily depends on the submission which is not necessarily deterministic?

I feels strange to say that you could pass a test case and then fail "the same" test case? They are quite obviously not the same then.

Even with your definition that could happen for non deterministic submissions?

what you are talking about is identical input, not identical test cases. I would argue that that can be sufficiently handled by symlinks?

Yes that is what I would be talking about since identical inputs actually appear, but this proposal would force me to name the link differently than the file that it points to? IMO this is very bad design.

Suppose you have multiple test groups (A, B, C, ...) that should all include the file easy/1.in. Then you would end up with these weird symlinks:

  • A/1a.in -> easy/1.in
  • B/1b.in -> easy/1.in
  • C/1c.in -> easy/1.in

In the past we could just all call them 1.in which made clear that they are the same in file.

In that discussion the consensus seems to go towards output_validator_flags being part of "the test case"

I would strongly advice against this. I always considered all files with the same base name to be part of the test case and nothing else. So if the output_validator_flags live in <test case>.yaml they are part of the test case but if they come from the test_group.yaml they are not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit late to the party but I have to agree with @Matistjati and @Tagl. I would go even further and would say that current proposal sounds pretty useless and does not fit the workflow used in the past. Let me explain why:

the sameness of a test case should imply the sameness of the judgement of said test case

This is the exact part that makes this proposal useless. If the judgement is the same is there even a need for the "same test case" in the first place?

I feel we are talking past each other here. I'm trying to define "same" here, how is "is there even a need for ..." relevant, and what do you mean by it?

I would argue that there are 2 reasonable definitions of "same" (or maybe we should use the term "identical", but I digress) we could use:

  1. identity, defined by the base name (including path), i.e. no two test cases are the "same", because, if nothing else, they have different names.
  2. contents, i.e. all information that affects judging for the test cases are the same. This implies that assuming deterministic submissions, the judgements will always be the same for "same" test cases. (This is the definition I'm suggesting).

I am arguing against a definition that does not include some information that affects judging. Meaning that we would say that two test cases are the "same", while not expecting them the be judged the same (even if assuming deterministic submissions). I don't think you want that either, so I think we agree on this part? @mzuenni?

I would also argue that this statement is fundamentally broken, since the judgement heavily depends on the submission which is not necessarily deterministic?

The problem format explicitly allows systems to assume determinism though. So, I strongly disagree that the statement is "fundamentally broken", maybe a bit sloppy, there should be an added "...assuming deterministic submissions".

I feels strange to say that you could pass a test case and then fail "the same" test case? They are quite obviously not the same then.

Even with your definition that could happen for non deterministic submissions?

Yes, but as stated above, systems may assume determinism, and also, is that a problem?

what you are talking about is identical input, not identical test cases. I would argue that that can be sufficiently handled by symlinks?

Yes that is what I would be talking about since identical inputs actually appear, but this proposal would force me to name the link differently than the file that it points to?

Only if you want identical input and different other settings. Is that what you want?

IMO this is very bad design.

Why?

Suppose you have multiple test groups (A, B, C, ...) that should all include the file easy/1.in. Then you would end up with these weird symlinks:

  • A/1a.in -> easy/1.in
  • B/1b.in -> easy/1.in
  • C/1c.in -> easy/1.in

(I think you meant to write A/B/C and not A/A/A above, right? I fixed it above.)

This is only the case if you want some other parts of the test case to be different. Why do you want/need that? (That's not a rhetorical question, I'm not implying that you couldn't reasonably want that, I know of some reasons to want it, I'm wondering what use cases you have in mind.)

In the past we could just all call them 1.in which made clear that they are the same in file.

Well, currently, it does not make it clear that they are the same file, there is no such requirement. You could have a file named D/1.in with completely different contents, so it's not safe at all to assume that files names the same are the same.

The suggestion here would be to make such a requirement, so that you could make such assumptions. I.e. the same (local) name to imply "same" test case.

In that discussion the consensus seems to go towards output_validator_flags being part of "the test case"

I would strongly advice against this. I always considered all files with the same base name to be part of the test case and nothing else. So if the output_validator_flags live in <test case>.yaml they are part of the test case but if they come from the test_group.yaml they are not.

Wow... this last part is IMO crazy!. You are saying that output_validator_flags is sometimes part of the test case and sometimes not?!? Why would you possibly want that definition?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem format explicitly allows systems to assume determinism though.
Yes, but as stated above, systems may assume determinism, and also, is that a problem?

Ok, I did not know that but that makes things even worse. IMO the judging system should not be allowed to do that. That not only easily breaks assumptions of problems setters but also of participants?

  1. Suppose I know there is a very reasonable submission that fails with a 50% chance on a very specific hand crafted test. I as problem setter would duplicate that test case to make the submission fail with high probability (this is not the best strategy but happened in the past and can not always be avoided.).
    If the judging system decides to only run one instance of this test case this is bad... I obviously wanted the test to be run multiple times. Why else should it be there?
  2. Even worse: participants who know about this get an advantage over participants who don't. This makes writing a randomized solution where you know only a few breaking cases can exist a valid startegy.
  3. Does this also imply that if the judging system sees an identical submission (identical source code) it can decide to not run it again?? This would be fundamentally breaking a lot of problems we had in the past... Actually it would break every problem where a randomized submission is intended.

Well, currently, it does not make it clear that they are the same file, there is no such requirement. You could have a file named D/1.in with completely different contents [...].

Yes I know that there is no such requirement but that is not the issue I wanted to point out. My problem is very much the other way around: with this proposal I would need to make the name of a symlink different to the name of the file it points to?

Wow... this last part is IMO crazy!. You are saying that output_validator_flags is sometimes part of the test case and sometimes not?!? Why would you possibly want that definition?

Yes and No. What I want to say is that if we make file names unique then the name of a test case should only refer to files associated with that name? (so <test case> refers to all files that have the form <test case>.<ext>)
If we want some kind of uniqueness constraints for names of test cases I would want the following meaning "If two files have the same base name + extension they should have the same content".

If that is not what you want I would argue in favor of not adding such a restriction at all.

Copy link
Contributor

@mzuenni mzuenni Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem format explicitly allows systems to assume determinism though.

One more point regarding this: This was not true for the legacy format, right? So if I upload a problem as legacy it cannot be cached but with the new format it can... That is terrible for every user group...

I even remember BAPC problems where the input was randomly generated... so this would no longer be possible at all??

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem format explicitly allows systems to assume determinism though.

This is also broken in a different way: it breaks every single randomized submission. The judge is basically allowed to rerun a submission an infinite number of times and check that it fails 0 times (so basically until it fails). Clearly that is not what we want right?

We very specifically mean: run this test case, and run it exactly once.

So I guess the discussion here is: can we change it to at most once for 'identical' testcases (for some definition of identical).

I would suggest that every data/**/*.in corresponds to exactly 1 run, as it I have always understood it.

If you want to avoid this, use require_pass: easier_group instead to be explicit.

If we want some kind of uniqueness constraints for names of test cases I would want the following meaning "If two files have the same base name + extension they should have the same content".

this sounds reasonable. Note that this does not imply the opposite: if two files have the same content, they should have the same base name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stricter limits on naming test cases

6 participants